This is the first in a series of notebooks designed to show you how to analyze social media data. We assume you have already downloaded the data and are now ready to begin examining it. In this first notebook I will show you how to set up your ipython working environment and import the Twitter data we have downloaded.
First, we will import several necessary Python packages. We will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. It is invaluable for analyzing datasets.
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.
#Set PANDAS to show all columns in DataFrame
pd.set_option('display.max_columns', None)
We can check which version of various packages we're using. You can see I'm running PANDAS 0.13 here.
print pd.__version__
0.13.1
PANDAS can read in data from a variety of different data types. If you've followed some of my earlier tutorials you have downloaded tweets into an SQLite database, then converted to a CSV file. That's what we have here. We have a set of tweets by Fortune 200 firms. So, in the following three lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the len function to see how many rows (tweets) there are in the dataset; there are 34,097 tweets in total. Finally, we will use the head function to show the first two rows of the dataset.
df = pd.read_csv('CSR_user_timeline_2013.csv', sep=',', low_memory=False)
print len(df)
df.head(2)
34097
rowid | query | tweet_id | tweet_id_str | inserted_date | truncated | language | possibly_sensitive | coordinates | retweeted_status | withheld_in_countries | withheld_scope | created_at_text | created_at | month | year | content | from_user_screen_name | from_user_id | from_user_followers_count | from_user_friends_count | from_user_listed_count | from_user_favourites_count | from_user_statuses_count | from_user_description | from_user_location | from_user_created_at | retweet_count | favorite_count | entities_urls | entities_urls_count | entities_hashtags | entities_hashtags_count | entities_mentions | entities_mentions_count | in_reply_to_screen_name | in_reply_to_status_id | source | entities_expanded_urls | entities_media_count | media_expanded_url | media_url | media_type | video_link | photo_link | twitpic | num_characters | num_words | retweeted_user | retweeted_user_description | retweeted_user_screen_name | retweeted_user_followers_count | retweeted_user_listed_count | retweeted_user_statuses_count | retweeted_user_location | retweeted_tweet_created_at | Fortune_2012_rank | Company | CSR_sustainability | specific_project_initiative_area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67340 | humanavitality | 306897327585652736 | 306897327585652736 | 2014-03-09 13:46:50.222857 | 0 | en | NaN | NaN | NaN | NaN | NaN | Wed Feb 27 22:43:19 +0000 2013 | 2013-02-27 22:43:19.000000 | 2 | 2013 | @louloushive (Tweet 2) We encourage other empl... | humanavitality | 274041023 | 2859 | 440 | 38 | 25 | 1766 | This is the official Twitter account for Human... | NaN | Tue Mar 29 16:23:02 +0000 2011 | 0 | 0 | NaN | 0 | NaN | 0 | louloushive | 1 | louloushive | 3.062183e+17 | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 121 | 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 79 | Humana | 0 | 1 |
1 | 39454 | FundacionPfizer | 308616393706844160 | 308616393706844160 | 2014-03-09 13:38:20.679967 | 0 | es | NaN | NaN | NaN | NaN | NaN | Mon Mar 04 16:34:17 +0000 2013 | 2013-03-04 16:34:17.000000 | 3 | 2013 | ¿Sabes por qué la #vacuna contra la #neumonía ... | FundacionPfizer | 188384056 | 2464 | 597 | 50 | 11 | 2400 | Noticias sobre Responsabilidad Social y Fundac... | México | Wed Sep 08 16:14:11 +0000 2010 | 1 | 0 | NaN | 0 | vacuna, neumonía | 2 | NaN | 0 | NaN | NaN | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 138 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40 | Pfizer | 0 | 1 |
2 rows × 60 columns
List all the columns in the DataFrame
df.columns
Index([u'rowid', u'query', u'tweet_id', u'tweet_id_str', u'inserted_date', u'truncated', u'language', u'possibly_sensitive', u'coordinates', u'retweeted_status', u'withheld_in_countries', u'withheld_scope', u'created_at_text', u'created_at', u'month', u'year', u'content', u'from_user_screen_name', u'from_user_id', u'from_user_followers_count', u'from_user_friends_count', u'from_user_listed_count', u'from_user_favourites_count', u'from_user_statuses_count', u'from_user_description', u'from_user_location', u'from_user_created_at', u'retweet_count', u'favorite_count', u'entities_urls', u'entities_urls_count', u'entities_hashtags', u'entities_hashtags_count', u'entities_mentions', u'entities_mentions_count', u'in_reply_to_screen_name', u'in_reply_to_status_id', u'source', u'entities_expanded_urls', u'entities_media_count', u'media_expanded_url', u'media_url', u'media_type', u'video_link', u'photo_link', u'twitpic', u'num_characters', u'num_words', u'retweeted_user', u'retweeted_user_description', u'retweeted_user_screen_name', u'retweeted_user_followers_count', u'retweeted_user_listed_count', u'retweeted_user_statuses_count', u'retweeted_user_location', u'retweeted_tweet_created_at', u'Fortune_2012_rank', u'Company', u'CSR_sustainability', u'specific_project_initiative_area'], dtype='object')
We can use the len function again here to see how many columns there are in the dataframe: 60.
len(df.columns)
60
We should also inspect the format for our columns. We can see that some are integers, some are 'float' (can have a decimal), and some are 'objects' (text). If you have a identifying text variable that has accidentally been imported as a float, for instance, that could cause problems down the road, so you should fix it before continuing.
df.dtypes
rowid int64 query object tweet_id_str int64 inserted_date object language object coordinates object retweeted_status object created_at object month int64 year int64 content object from_user_screen_name object from_user_id int64 from_user_followers_count int64 from_user_friends_count int64 from_user_listed_count int64 from_user_favourites_count int64 from_user_statuses_count int64 from_user_description object from_user_location object from_user_created_at object retweet_count int64 favorite_count int64 entities_urls object entities_urls_count int64 entities_hashtags object entities_hashtags_count int64 entities_mentions object entities_mentions_count int64 in_reply_to_screen_name object in_reply_to_status_id float64 source object entities_expanded_urls object entities_media_count float64 media_expanded_url object media_url object media_type object video_link int64 photo_link int64 twitpic int64 num_characters int64 num_words int64 retweeted_user float64 retweeted_user_description object retweeted_user_screen_name object retweeted_user_followers_count float64 retweeted_user_listed_count float64 retweeted_user_statuses_count float64 retweeted_user_location object retweeted_tweet_created_at object Fortune_2012_rank int64 Company object CSR_sustainability int64 specific_project_initiative_area int64 Length: 54, dtype: object
Every researcher will have different preferences about keeping or deleting unneeded columns. You might want to delete variables to make your dataset cleaner, you might want to save memory, or you might want a smaller dataset for some specific analyses. In any case, we can use the drop command to delete individual columns. Let's drop six that are not needed here. The first two have duplicate columns in another format, while the latter four all have zero variation (all are blank).
df = df.drop('created_at_text',1)
df = df.drop('tweet_id',1)
df = df.drop('withheld_in_countries',1)
df = df.drop('withheld_scope',1)
df = df.drop('truncated',1)
df = df.drop('possibly_sensitive',1)
There are now 54 columns in the dataframe.
len(df.columns)
54
df.head(2)
rowid | query | tweet_id_str | inserted_date | language | coordinates | retweeted_status | created_at | month | year | content | from_user_screen_name | from_user_id | from_user_followers_count | from_user_friends_count | from_user_listed_count | from_user_favourites_count | from_user_statuses_count | from_user_description | from_user_location | from_user_created_at | retweet_count | favorite_count | entities_urls | entities_urls_count | entities_hashtags | entities_hashtags_count | entities_mentions | entities_mentions_count | in_reply_to_screen_name | in_reply_to_status_id | source | entities_expanded_urls | entities_media_count | media_expanded_url | media_url | media_type | video_link | photo_link | twitpic | num_characters | num_words | retweeted_user | retweeted_user_description | retweeted_user_screen_name | retweeted_user_followers_count | retweeted_user_listed_count | retweeted_user_statuses_count | retweeted_user_location | retweeted_tweet_created_at | Fortune_2012_rank | Company | CSR_sustainability | specific_project_initiative_area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67340 | humanavitality | 306897327585652736 | 2014-03-09 13:46:50.222857 | en | NaN | NaN | 2013-02-27 22:43:19.000000 | 2 | 2013 | @louloushive (Tweet 2) We encourage other empl... | humanavitality | 274041023 | 2859 | 440 | 38 | 25 | 1766 | This is the official Twitter account for Human... | NaN | Tue Mar 29 16:23:02 +0000 2011 | 0 | 0 | NaN | 0 | NaN | 0 | louloushive | 1 | louloushive | 3.062183e+17 | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 121 | 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 79 | Humana | 0 | 1 |
1 | 39454 | FundacionPfizer | 308616393706844160 | 2014-03-09 13:38:20.679967 | es | NaN | NaN | 2013-03-04 16:34:17.000000 | 3 | 2013 | ¿Sabes por qué la #vacuna contra la #neumonía ... | FundacionPfizer | 188384056 | 2464 | 597 | 50 | 11 | 2400 | Noticias sobre Responsabilidad Social y Fundac... | México | Wed Sep 08 16:14:11 +0000 2010 | 1 | 0 | NaN | 0 | vacuna, neumonía | 2 | NaN | 0 | NaN | NaN | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 138 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40 | Pfizer | 0 | 1 |
2 rows × 54 columns
If you have only a few columns to delete you can use the drop command as shown above. On the other hand, if you only want to keep a few columns, you can create a new version of the dataframe with only those columns you like. Note that the double square brackets -- "[[...]]" -- in PANDAS forms a dataframe representation. In the following example, I am creating a new dataframe with only three variables. You can see that this new dataframe has the same number of tweets but fewer columns (variables).
df2 = df[['created_at', 'from_user_screen_name', 'retweet_count']]
print len(df2)
df2.head(2)
34097
created_at | from_user_screen_name | retweet_count | |
---|---|---|---|
0 | 2013-02-27 22:43:19.000000 | humanavitality | 0 |
1 | 2013-03-04 16:34:17.000000 | FundacionPfizer | 1 |
2 rows × 3 columns
We can use the unique function to find how many unique Twitter accounts are represented in the dataset. First, I'll show you what unique function does -- it creates an array of all the screen_names of the Twitter accounts.
pd.unique(df.from_user_screen_name.ravel())
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR', 'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW', 'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship', 'TICalculators', 'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green', 'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen', 'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation', 'nikebetterworld', 'HoneywellBuild', 'googlestudents', '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower', 'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)
Note again how we can use len to find out how many accounts are in the array: 42
len(pd.unique(df.from_user_screen_name.ravel()))
42
We want to get rid of all tweets by TICalculators from the dataframe. Unlike the other 41 Twitter accounts in the dataset, this account is not a CSR-related account. First, we can use the len function combined with a dataframe query to count the number of tweets that are not sent by TICalculators: 32,300
len(df[df['from_user_screen_name'] != 'TICalculators'])
32330
We should then also check how many tweets are sent by TICalculators: 1,767
len(df[df['from_user_screen_name'] == 'TICalculators'])
1767
We can use Python to do "math." Let's use this to show whether the two numbers returned in the above steps add up to the total number of tweets in our dataframe. They do. While this may seem like an unnecessary step, it is always critical to perform such basic data checks in order to avert unexpected data disasters.
1767 + 32330
34097
We can also do this another way
(1767 + 32330) - len(df)
0
Or even
len(df[df['from_user_screen_name'] != 'TICalculators']) + len(df[df['from_user_screen_name'] == 'TICalculators']) - len(df)
0
In the next block of code we will create a new version of our dataframe, this time limiting it to only those tweets that are not sent by TICalculators. As we can see, there are now 32,300 tweets in this dataframe, the same number as we calculated above.
df = df[df['from_user_screen_name'] != 'TICalculators']
print len(df)
df.head(2)
32330
rowid | query | tweet_id_str | inserted_date | language | coordinates | retweeted_status | created_at | month | year | content | from_user_screen_name | from_user_id | from_user_followers_count | from_user_friends_count | from_user_listed_count | from_user_favourites_count | from_user_statuses_count | from_user_description | from_user_location | from_user_created_at | retweet_count | favorite_count | entities_urls | entities_urls_count | entities_hashtags | entities_hashtags_count | entities_mentions | entities_mentions_count | in_reply_to_screen_name | in_reply_to_status_id | source | entities_expanded_urls | entities_media_count | media_expanded_url | media_url | media_type | video_link | photo_link | twitpic | num_characters | num_words | retweeted_user | retweeted_user_description | retweeted_user_screen_name | retweeted_user_followers_count | retweeted_user_listed_count | retweeted_user_statuses_count | retweeted_user_location | retweeted_tweet_created_at | Fortune_2012_rank | Company | CSR_sustainability | specific_project_initiative_area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67340 | humanavitality | 306897327585652736 | 2014-03-09 13:46:50.222857 | en | NaN | NaN | 2013-02-27 22:43:19.000000 | 2 | 2013 | @louloushive (Tweet 2) We encourage other empl... | humanavitality | 274041023 | 2859 | 440 | 38 | 25 | 1766 | This is the official Twitter account for Human... | NaN | Tue Mar 29 16:23:02 +0000 2011 | 0 | 0 | NaN | 0 | NaN | 0 | louloushive | 1 | louloushive | 3.062183e+17 | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 121 | 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 79 | Humana | 0 | 1 |
1 | 39454 | FundacionPfizer | 308616393706844160 | 2014-03-09 13:38:20.679967 | es | NaN | NaN | 2013-03-04 16:34:17.000000 | 3 | 2013 | ¿Sabes por qué la #vacuna contra la #neumonía ... | FundacionPfizer | 188384056 | 2464 | 597 | 50 | 11 | 2400 | Noticias sobre Responsabilidad Social y Fundac... | México | Wed Sep 08 16:14:11 +0000 2010 | 1 | 0 | NaN | 0 | vacuna, neumonía | 2 | NaN | 0 | NaN | NaN | web | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 138 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40 | Pfizer | 0 | 1 |
2 rows × 54 columns
Now let's check again for all the unique accounts in the dataframe -- as you can see, TICalculators is gone and there are now 41 accounts.
pd.unique(df.from_user_screen_name.ravel())
array(['humanavitality', 'FundacionPfizer', 'Intelinvolved', 'CiscoCSR', 'IBMSmartCities', 'citizenIBM', 'BofA_Community', 'PG_CSDW', 'mathmovesu', 'PPGIdeascapes', 'gehealthy', 'msftcitizenship', 'CiscoEDU', 'DuPont_ability', 'Dell4Good', 'verizongiving', 'DellEDU', 'SprintGreenNews', 'TeachingMoney', 'WalmartGreen', 'ecomagination', 'WalmartAction', 'Microsoft_Green', 'Comcastdreambig', 'DE_Youtility', 'hpglobalcitizen', 'CitizenDisney', 'AmgenFoundation', 'ATTAspire', 'AlcoaFoundation', 'nikebetterworld', 'HoneywellBuild', 'googlestudents', '3M_FoodSafety', 'HeartRescue', 'EnviroSears', 'ClickToEmpower', 'GreenIBM', 'PromesaPepsiCo', 'FordDriveGreen', 'JNJStories'], dtype=object)
len(pd.unique(df.from_user_screen_name.ravel()))
41
We now have our new dataframe without TICalculators. We are left with 32,330 tweets sent by 41 companies over the course of 2013.
We will now save the dataframe in PANDAS' native format. It's called 'pickling' a file, so we'll give it the typical 'pkl' extension.
df.to_pickle('CSR tweets - 2013 by 41 accounts.pkl')
For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton